Return to Data Visualisation Section
The data comes from Eurocontrol by way of the TidyTuesday project hosted by the R4DS community.
To show only the results, large code chunks are hidden, but can be unfolded by clicking the “Code” boxes on the top right of each hidden code chunk.
The data is fairly large with 688,099 observations and 14 variables.
head(flights)
I can now start investigating questions I asked myself while looking at the data set.
There are 332 European airports in the data set, for all of which there is information on incoming and outgoing flights.
flights %>% count(apt_name)
Before I can plot, I need to extrapolate for 2022, as the most recent data is from May 2022. 2022 not yet being complete, comparing it to fully completed past periods would not make much sense. My approach below is to calculate a percentage of flights out of total flights that have already taken place up until May in each year over all European airports to get a feeling which percentage might be used for extrapolation.
flights %>%
group_by(apt_name, year) %>%
mutate(tot_year = sum(flt_tot_1)) %>%
filter(month_num <= 5,
year < 2020) %>%
group_by(year, apt_name) %>%
summarise(tot_until_may = sum(flt_tot_1),
tot_year = last(tot_year)) %>%
mutate(pct_may = tot_until_may/tot_year) %>%
group_by(year) %>%
summarise(pct_may = mean(pct_may)) %>%
arrange(year)
It looks like around 38% of all flights took place in the first 5 months of each year pre-pandemic. This percentage will be used to roughly approximate the final number for 2022.
flights %>%
count(apt_name, year, wt = flt_tot_1, sort = T, name = "that_year") %>%
add_count(apt_name, wt = that_year, name = "total") %>%
mutate(that_year = case_when(year == 2022 ~ that_year/0.38,
TRUE ~ that_year)) %>%
slice_max(order_by = total, n = 30) %>%
ggplot(aes(year, that_year, colour = apt_name)) +
geom_line(alpha = 0.6) +
geom_point(alpha = 0.6) +
labs(title = "Europe's Five Busiest Airports: Flights Handled By Year",
subtitle = "Unfinished 2022 has been extrapolated using historic traffic patterns.",
y = NULL,
x = "Year",
colour = NULL) +
scale_y_continuous(labels = scales::comma_format()) +
ggsci::scale_colour_jama() +
theme_minimal() +
theme(plot.title = element_text(size = 12, face = "bold", colour = "black"),
plot.subtitle = element_text(face = "italic", colour = "gray50",
size = 10),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
A more technical piece of information that might be interesting to look at is the empirical cumulative distribution function (ECDF) for the volume of flights. For the latter, I need to decide on a year (2019 in this case) and sort the total volume for each airport descending in order to create a cumulative percentage.
flights %>%
filter(year == 2019) %>%
count(apt_name, wt = flt_tot_1, sort = T) %>%
transmute(apt_name = 1:nrow(.), pct = n/sum(n),
pct = cumsum(pct)) %>%
ggplot(aes(apt_name, pct)) +
geom_step(colour = "midnightblue", alpha = 0.6) +
labs(title = "ECDF Of Total Flights Handled In 2018",
subtitle = "Airports are sorted descending by volume on the x-axis",
y = NULL,
x = "Airports") +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal() +
theme(plot.title = element_text(size = 12, face = "bold", colour = "black"),
plot.subtitle = element_text(face = "italic", colour = "gray50",
size = 10))
Interestingly, about one third of all European airports handle more than 80% of flight traffic and 15% of all European airports handle more than 60% of all flight traffic. This distribution shows the strong draw of huge flight hubs in Europe, where large parts of the population travel long distances by plane or other modes of transport before even stepping on the plane taking them to their final destination.
In order to find intra-year patterns, I want to look at average daily flights handled on a monthly aggregation basis. Given the large number of over 300 airports, I want to see the ones with the highest fluctuation of that metric, which will likely be the ones with highest seasonality.
flights %>%
group_by(apt_name, apt_icao, month_num) %>%
summarise(mean_daily = mean(flt_tot_1),
lower_daily = quantile(flt_tot_1, 0.25),
higher_daily = quantile(flt_tot_1, 0.75)) %>%
ungroup() %>%
group_by(apt_name, apt_icao) %>%
summarise(volatility_daily = sd(mean_daily),
total = sum(mean_daily)) %>%
arrange(desc(volatility_daily))
It looks like none of them are due to very low numbers of flights, so I can proceed to make the visualisation.
flights %>%
mutate(month = lubridate::month(flt_date, label = T)) %>%
filter(apt_icao %in% c("LEPA", "LTAI", "LGAV", "LEIB")) %>%
group_by(apt_name, month) %>%
summarise(mean_daily = mean(flt_tot_1),
lower_daily = quantile(flt_tot_1, 0.25),
higher_daily = quantile(flt_tot_1, 0.75)) %>%
ggplot(aes(month, mean_daily, group = apt_name)) +
geom_line(aes(colour = apt_name)) +
geom_ribbon(aes(ymin = lower_daily, ymax = higher_daily, fill = apt_name),
alpha = 0.15) +
labs(title = "The Most Seasonal European Airports Are Summer Destinations",
subtitle = "Confidence bands show 25th and 75th percentile.\nValues shown are average daily flights handled by month.",
y = "Daily Flights Handled",
x = NULL,
colour = "Airport") +
guides(fill = "none") +
ggsci::scale_colour_jama() +
ggsci::scale_fill_jama() +
scale_y_continuous(labels = scales::comma_format()) +
expand_limits(y = 0) +
theme_minimal() +
theme(plot.title = element_text(size = 12, face = "bold", colour = "black"),
plot.subtitle = element_text(face = "italic", colour = "gray50",
size = 10),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
flights %>%
mutate(month = lubridate::month(flt_date, label = T)) %>%
filter(apt_icao %in% c("EGLL", "ELLX", "LSGG", "EDDT", "ESSA")) %>%
group_by(apt_name, month) %>%
summarise(mean_daily = mean(flt_tot_1),
lower_daily = quantile(flt_tot_1, 0.25),
higher_daily = quantile(flt_tot_1, 0.75)) %>%
ggplot(aes(month, mean_daily, group = apt_name)) +
geom_line(aes(colour = apt_name)) +
geom_ribbon(aes(ymin = lower_daily, ymax = higher_daily, fill = apt_name),
alpha = 0.15) +
labs(title = "The Least Seasonal European Airports Are In Central\nEurope And Often Places Of Work Or Business",
subtitle = "Confidence bands show 25th and 75th percentile.\nValues shown are average daily flights handled by month.",
y = "Daily Flights Handled",
x = NULL,
colour = "Airport") +
guides(fill = "none") +
ggsci::scale_colour_jama() +
ggsci::scale_fill_jama() +
scale_y_continuous(labels = scales::comma_format()) +
expand_limits(y = 0) +
theme_minimal() +
theme(plot.title = element_text(size = 12, face = "bold", colour = "black"),
plot.subtitle = element_text(face = "italic", colour = "gray50",
size = 10),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
Lastly, it would be interesting to see which destinations get a much stronger influx of people than throughout the rest of the year. This will be an indication of people heading there to spend time over Christmas holidays, in turn revealing the most popular spots for spending Christmas or even Silvester that are mostly reached by plane.
flights %>%
mutate(week = lubridate::week(flt_date),
week = case_when(week %in% c(51, 52) ~ "Silvester",
TRUE ~ "Rest")) %>%
group_by(apt_name, state_name, week) %>%
summarise(arriving = mean(flt_arr_1)) %>%
group_by(apt_name) %>%
mutate(total = sum(arriving)) %>%
ungroup() %>%
filter(total > 70) %>%
select(-total) %>%
ungroup() %>%
pivot_wider(values_from = arriving, names_from = week) %>%
mutate(silvester_ratio = Silvester/Rest - 1) %>%
arrange(desc(silvester_ratio)) %>%
filter(silvester_ratio > 0.05) %>%
mutate(apt_name = paste0(apt_name, " (", state_name, ")"),
apt_name = factor(apt_name) %>% fct_reorder(silvester_ratio)) %>%
select(-c(Rest, Silvester)) %>%
ggplot(aes(silvester_ratio, apt_name)) +
geom_point(colour = "midnightblue", size = 3, alpha = 0.6) +
geom_segment(aes(x = 0, xend = silvester_ratio,
y = apt_name, yend = apt_name),
colour = "midnightblue",
lty = "dashed", alpha = 0.6) +
geom_vline(xintercept = 0, lty = "dashed", colour = "midnightblue",
alpha = 0.6) +
labs(title = "Warmer European Destinations Are Popular Over Christmas Holidays",
subtitle = "Values show growth of daily arrivals in the last two weeks relative to the rest of the year.",
y = NULL,
x = NULL) +
scale_x_continuous(labels = scales::percent_format()) +
expand_limits(x = 0) +
theme_minimal() +
theme(plot.title = element_text(size = 12, face = "bold", colour = "black"),
plot.subtitle = element_text(face = "italic", colour = "gray50",
size = 10),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
Clearly, most regions with more arrivals for the festive days are in Southern Europe, presumably because of the climate.
There was not much more to be squeezed out of the small amount of variables, but I hope you enjoyed it and learned something new from this data.
Thank you for reading and feel free to reach out!
A work by Mathias Steilen